NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Optimizing genomic prediction for complex traits via investigating multiple factors in switchgrass

https://doi.org/10.1093/plphys/kiaf188

Wang, Peipei; Meng, Fanrui; Del_Azodi, Christina B; Abá, Kenia Segura; Casler, Michael D; Shiu, Shin-Han (May 2025, Plant Physiology)

Abstract Genomic prediction has accelerated breeding processes and provided mechanistic insights into the genetic bases of complex traits. To further optimize genomic prediction, we assess the impact of genome assemblies, genotyping approaches, variant types, allelic complexities, polyploidy levels, and population structures on the prediction of 20 complex traits in switchgrass (Panicum virgatum L.), a perennial biofuel feedstock. Surprisingly, short read-based genome assembly performs comparably to or even better than long read-based assembly. Due to higher gene coverage, exome capture and multi-allelic variants outperform genotyping-by-sequencing and bi-allelic variants, respectively. Tetraploid models show higher prediction accuracy than octoploid models for most traits, likely due to the greater genetic distances among tetraploids. Depending on the trait in question, different types of variants need to be integrated for optimal predictions. Our study provides insights into the factors influencing genomic prediction outcomes, guiding best practices for future studies and for improving agronomic traits in switchgrass and other species through selective breeding.
more » « less
Free, publicly-accessible full text available May 7, 2026
Impact of short-read sequencing on the misassembly of a plant genome

https://doi.org/10.1186/s12864-021-07397-5

Wang, Peipei; Meng, Fanrui; Moore, Bethany M.; Shiu, Shin-Han (December 2021, BMC Genomics)
null (Ed.)
Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.
more » « less
Full Text Available
Predictive models of genetic redundancy in Arabidopsis thaliana

https://doi.org/10.1093/molbev/msab111

Cusack, Siobhan A; Wang, Peipei; Lotreck, Serena G; Moore, Bethany M; Meng, Fanrui; Conner, Jeffrey K; Krysan, Patrick J; Lehti-Shiu, Melissa D; Shiu, Shin-Han (April 2021, Molecular Biology and Evolution)
de Meaux, Juliette (Ed.)
Abstract Genetic redundancy refers to a situation where an individual with a loss-of-function mutation in one gene (single mutant) does not show an apparent phenotype until one or more paralogs are also knocked out (double/higher-order mutant). Previous studies have identified some characteristics common among redundant gene pairs, but a predictive model of genetic redundancy incorporating a wide variety of features derived from accumulating omics and mutant phenotype data is yet to be established. In addition, the relative importance of these features for genetic redundancy remains largely unclear. Here, we establish machine learning models for predicting whether a gene pair is likely redundant or not in the model plant Arabidopsis thaliana based on six feature categories: functional annotations, evolutionary conservation including duplication patterns and mechanisms, epigenetic marks, protein properties including post-translational modifications, gene expression, and gene network properties. The definition of redundancy, data transformations, feature subsets, and machine learning algorithms used significantly affected model performance based on hold-out, testing phenotype data. Among the most important features in predicting gene pairs as redundant were having a paralog(s) from recent duplication events, annotation as a transcription factor, downregulation during stress conditions, and having similar expression patterns under stress conditions. We also explored the potential reasons underlying mispredictions and limitations of our studies. This genetic redundancy model sheds light on characteristics that may contribute to long-term maintenance of paralogs, and will ultimately allow for more targeted generation of functionally informative double mutants, advancing functional genomic studies.
more » « less
Full Text Available
Factors Influencing Gene Family Size Variation Among Related Species in a Plant Family, Solanaceae

https://doi.org/10.1093/gbe/evy193

Wang, Peipei; Moore, Bethany M; Panchy, Nicholas L; Meng, Fanrui; Lehti-Shiu, Melissa D; Shiu, Shin-Han; Van De Peer, Yves (September 2018, Genome Biology and Evolution)

Full Text Available
High‐throughput measurement of plant fitness traits with an object detection method using Faster R‐CNN

https://doi.org/10.1111/nph.18056

Wang, Peipei; Meng, Fanrui; Donaldson, Paityn; Horan, Sarah; Panchy, Nicholas L.; Vischulis, Elyse; Winship, Eamon; Conner, Jeffrey K.; Krysan, Patrick J.; Shiu, Shin‐Han; et al (March 2022, New Phytologist)

Summary Revealing the contributions of genes to plant phenotype is frequently challenging because loss‐of‐function effects may be subtle or masked by varying degrees of genetic redundancy. Such effects can potentially be detected by measuring plant fitness, which reflects the cumulative effects of genetic changes over the lifetime of a plant. However, fitness is challenging to measure accurately, particularly in species with high fecundity and relatively small propagule sizes such asArabidopsis thaliana.An image segmentation‐based method using the software ImageJ and an object detection‐based method using the Faster Region‐based Convolutional Neural Network (R‐CNN) algorithm were used for measuring two Arabidopsis fitness traits: seed and fruit counts.The segmentation‐based method was error‐prone (correlation between true and predicted seed counts,r² = 0.849) because seeds touching each other were undercounted. By contrast, the object detection‐based algorithm yielded near perfect seed counts (r² = 0.9996) and highly accurate fruit counts (r² = 0.980). Comparing seed counts for wild‐type and 12 mutant lines revealed fitness effects for three genes; fruit counts revealed the same effects for two genes.Our study provides analysis pipelines and models to facilitate the investigation of Arabidopsis fitness traits and demonstrates the importance of examining fitness traits when studying gene functions.
more » « less

Search for: All records